Skip to main content

Scrapy Guide

Scrapy is a Python framework for building web crawlers. Unlike requests + BeautifulSoup, Scrapy handles everything — sending requests, parsing HTML, following links, and saving data — all in one structured project.

Use Scrapy when your crawler needs to:

  • Handle many pages or multiple URLs
  • Follow links automatically
  • Save data to files or databases in a consistent way

How Scrapy Works

The Spider defines what to crawl and how to parse it. The Engine handles scheduling and data flow. The Pipeline processes and saves the scraped data.


Installation

pip install scrapy

Verify the installation:

scrapy version

Create a Scrapy Project

scrapy startproject my_crawler
cd my_crawler

Project Structure

my_crawler/
├── scrapy.cfg
└── my_crawler/
├── __init__.py
├── items.py # Define data structure
├── middlewares.py # Custom request/response hooks
├── pipelines.py # Process and save scraped data
├── settings.py # Project configuration
└── spiders/ # Put your spiders here
└── __init__.py

The most important files for beginners are:

  • spiders/: Where you write your crawling logic
  • items.py: Define the fields you want to collect
  • pipelines.py: Process the collected data (e.g., save to CSV, database)
  • settings.py: Control crawl speed, headers, pipelines, etc.

Create Your First Spider

Create a file inside spiders/ — for example, spiders/quotes_spider.py:

import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes" # Unique name for this spider
start_urls = [
"https://quotes.toscrape.com/",
]

def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}

# Follow the "next" page link
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)

Key parts of a spider:

  • name: Used to run the spider from the terminal
  • start_urls: The pages Scrapy will crawl first
  • parse(): Called for each response — extract data and follow links here

CSS Selectors in Scrapy

Scrapy uses CSS selectors (similar to BeautifulSoup) to extract data from HTML.

MethodDescription
response.css("tag::text").get()Get text from the first matching element
response.css("tag::text").getall()Get text from all matching elements
response.css("tag::attr(href)").get()Get an attribute value
response.css("tag")Select elements (returns a list)

Example

# HTML: <a href="/page/2" class="next">Next</a>
response.css("a.next::attr(href)").get() # → "/page/2"
response.css("a.next::text").get() # → "Next"

Run the Spider

scrapy crawl quotes

Save Output to a File

# Save as JSON
scrapy crawl quotes -o output.json

# Save as CSV
scrapy crawl quotes -o output.csv

Use Scrapy Shell to Debug

Before writing your spider, use the Scrapy shell to test your selectors interactively.

scrapy shell "https://quotes.toscrape.com/"

Inside the shell:

# Test selectors
response.css("div.quote span.text::text").getall()
response.css("small.author::text").getall()

This saves a lot of time — you can confirm the selector works before putting it in your spider.


Custom Headers and Settings

Edit settings.py to configure the crawler:

# Slow down requests to be polite
DOWNLOAD_DELAY = 1 # seconds between requests

# Set a custom User-Agent
USER_AGENT = "Mozilla/5.0 (compatible; MyBot/1.0)"

For pipeline configuration, see Scrapy Pipeline.


Using items.py makes your data structure explicit and easier to reuse.

# items.py
import scrapy

class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()

In your spider, yield the item instead of a plain dict:

from my_crawler.items import QuoteItem

def parse(self, response):
for quote in response.css("div.quote"):
item = QuoteItem()
item["text"] = quote.css("span.text::text").get()
item["author"] = quote.css("small.author::text").get()
yield item

Scrapy vs requests + BeautifulSoup

Featurerequests + BS4Scrapy
SetupSimpleRequires project structure
Multi-page crawlingManual loopBuilt-in with follow()
ConcurrencyNo (sequential)Yes (async by default)
Data pipelineManualBuilt-in pipeline
Best forSmall, one-off scriptsLarge or recurring crawlers

Use requests + BeautifulSoup for quick scripts. Use Scrapy when you need something more structured and scalable.